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The Fast Fourier Transform is a mainstay of certain numerical techniques for solving fluid 
dynamics problems. The Connection Machine CM-2 is the target for an investigation into the 
design of multidimensional SIMD parallel FFT algorithms for high performance. Critical 
algorithm design issues are discussed, necessary machine performance measurements are 
identified and made, and the performance of the developed FFT programs are measured. Our FFT 
programs are compared to the currently best Cray-2 FFT program. 
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1 Introduction 


As advanced computer systems continue to increase in performance, 
computationally intensive problems that were not feasible to attack are 
becoming more practical. Included in this category are the compressible 
Reynolds averaged Navier-Stokes equations in the. field of computational fluid 
dynamics (CFD). These equations model the flow of viscous, Newtonian fluids 
at subsonic, transonic, and supersonic speeds. 

Extensive work in this area is being done in the Numerical Aerodynamic 
Simulation (NAS) Systems Division at the NASA Ames Research Center. For 
example, detailed transonic flow analysis has been performed on the airframe 
geometry of the McDonnell-Douglas F-16A jet. fighter using- a three 
dimensional grid of 129x33x33 points. The vast memory of Cray-2 
supercomputers (256 MWords), currently employed by the NAS, allows 
implementation of finer grids to provide a more detailed analysis of the flow 
over complex geometries. 

The Fast Fourier Transform (FFT) is a mainstay of certain CFD methods 
because of its use in solving partial differential equations. The FFT is 
computationally intensive for the problem sizes of interest, making efficient 
implementations of FFT algorithms essential. Many investigations have 
considered the FFT on conventional [BRI74] and vector [SWA84] 
architectures. Since the FFT algorithm admits to considerable parallelism, it 
is attractive for computation on a parallel computer. The fine grain, 
massively parallel nature of the Connection Machine 1 model CM-2 make it a 
logical choice for an FFT investigation. Preliminary FFT algorithm design for 
the CM-2 has been explored in [JHJ88]. Detailed analysis of an FFT 
algorithm for another parallel computer that supports various modes of 
parallelism has been done [BCJ89], 

The work presented here is a design and tradeoff study taken through to 
algorithm implementation and algorithm performance measurement. In the 
supercomputing arena, where the highest performance algorithms are sought, 
an algorithm study without implementation and performance measurement is 
unlikely to result in the best algorithm design. The behavior of computers is 
complex, and readily measured quantities such as total number of instructions 
or processor utilization do not necessarily provide guidance for reaching the 
goal of a high performance algorithm. The research began on a CM-2 with 
8192 processors running Version 4.3 of the system software, and continued 
onto the current configuration of 32768 processors with floating point 
hardware running Version 5.0/? system software. The CM-2 is sited at the 
NAS division of the NASA Ames Research Center. 


^Connection Machine is a trademark of Thinking Machines, Inc. 
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2 Background 

2.1 The Fast Fourier Transform 


The discrete Fourier Transform (DFT) was developed for digital 
computer evaluation of the continuous Fourier transform, which relates the 
time domain of a signal to its corresponding frequency domain. The periodic 
function x(t) with period T can be represented by a Fourier series of the form 

x(t) = — - + £ [a n cos(r»u/ 0 f) + 6„sin(nw o 0] (1) 

2 n — 1 

where ojq is the fundamental frequency, equal to ~/T. By Euler’s Identity, 

e> 9 — cos 9 + jsin 9 , 
where j — —1 , Equation (1) can be written as 


( 2 ) 


Since 


and 


2 “ E «n« ;n “ 0t 

1 n — - 1 


oo 


E 

n—1 


—jnuj 0 t 


- '£ a.*-' 


) 


Equation (2) simplifies to 
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Letting c n = j{a n - jb n ) gives 

x{t) = 2 > 

r»— oo 

which is the the Fourier series expressed in exponential form. The derivation 
of the Fourier transform comes from the limiting case of the Fourier series. 
From the definitions 

2 ~ T/ 2 

- -=r f x(t)cos(noj 0 t)dt 
1 r /2 


and 
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2 " T/2 

b n = f x(t)s\n(noj Q t)dt 
T T/2 

we have 

c « = -r / *(0« ;n “° • 

•* r/2 

As the period increases without limit, the frequency moves from being a 
discrete variable to becoming a continous variable, ncj Q — ► w. Also, 

+oo 

e n T —* f x(t)e~ ,wi dt [NIL83]. If the integral limits are confined over AT time 

— OO 

samples for which N independent frequency samples can be calculated, the 
DFT results. Thus, for » * 0, 1, 2, • • • , N — 1 and k — 0, 1, 2, • • • , AT— 1 

Ar _ l 

X(koj 0 ) = 2 *(»Z> * 

n - 0 


For simplicity, the terms T and c% are dropped from the indicies giving the 
co mm on form of the DFT. 

W-, 

X(*) - 2 x(n)e w 
0 


The values 


iV-i , 

- . 

n-0 

—2irjk 

W k - e * Jb - 0, 1, ..., AT— 1 


( 3 ) 


are referred to as twiddle factors. 

Straight-forward evaluation of Equation (3) for all values of k has time 
complexity 0(N 2 ). However, because of inherent redundancies in the 
evaluation of the transform, an 0 (Nlog 2 N) time algorithm can be 
constructed. 


If N is factored as iV* LxM then the 1-dimensional vector can be 
represented as a 2-dimensional array with L columns and M rows [RAG75]. 
To facilitate indexing of this array we define 


where 


n = Ml + m, k — Lr+s 


x(n) — x(l,m ) 


l — 0) ...,£ — 1, 


m — 0 , 1 


X(k)^X(s,r) s * 0, . ,.,L— 1, 


Expressing (3) using this indexing scheme yields 


r = 0, ...,M — 1 
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X(k) - X(s, r ) - m) W^ m + m ^ Lr + *) . (4) 

m-Ql-0 

Simplifying and rearranging terms gives 

X{s , r) = m) + mir + + 

m - 0/-0 

JW-lL-1 . 

* 2 r r ' 

m — 0 / — 0 

= i ^ 1 w imr m ) wMil ( 5 ) 

m — 0 /— 0 

since W w,£,r ** W wir ** 1. Therefore, X(s, r) can be computed as two multiple 
transforms. First, M-transforms of length L are computed. If we let q(s,m ) 
be this result multiplied by W™, 

q (s, m) = W™j:x{l, m) W Msl , 

1-0 

then Equation (5) reduces to 

• ( 6 ) 

m — 0 

Finally, compute the M-point DFT of each row of the q(s, m) matrix using (6). 
Instead of computing X (s, r) directly from Equation (6), it can be factored in 
the same way Equation (3) was factored. If N « N Q Ni'”N n _i, then n such 

factorizations are possible, which yields the Cooley-Tukey Fast Fourier 

Transform (FFT) [COT65]. When N 0 — Nj = • • • * N n _ i = r, the 
algorithm is described as radix- r. As a result of the factorization, the output 
data vector is in bit-reversed order. Specifically, element X[a n _ 1 o n _ 2 * * * oxao] 
must be swapped with element X[o 0 fli * • • a„_ 2 a n-i] where N — 2 n . For 
example if N— 8, element X[l] must be swapped with element X[4] since 
(l) 10 s (001) 2 and (4) 10 = (100) 2 . Similarly, the elements X[3j and X[6] must 
also be swapped. 

Since the FFT algorithm described successively divides the original 
sequence into equal halves, the frequency resolution of the output of both of 
the smaller DFTs is decreased by a factor of two. This corresponds to 
dropping (decimating) alternate outputs of the original DFT and is called a 
decimation-in- frequency (DIF) FFT. By simply reordering the summations in 
(5), a decimation-in-time (DIT) FFT can be derived. 

Each stage in an Appoint, radix-r FFT consists of N jr independent 
butterfly operations. A butterfly operation takes r inputs and the 
corresponding twiddle factor(s) to compute each of the r outputs. The flow 
graphs in Figures 1 and 2 illustrate the DIF and DIT algorithms, respectively, 
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for radix-2 and N=8. As shown, the DIF FFT produces bit-reversed output 
from natural ordered input, while the DIT FFT produces ordered output from 
bit- reversed input. Preference for DIF versus DIT depends on data ordering 
requirements imposed by previous or subsequent computations or I/O activity. 

Many factorization schemes have been implemented as an attempt to 
reduce the total number of complex operations required by the FFT [SWA87]. 
For example, if N is an integer power of four then a radix-4 algorithm can be 
used. This reduces the number of arithmetic operations required as compared 
to the radix-2 algorithm [BER68]. The same holds true for radix-8 and radix- 
16 algorithms. If IV is a power of two but not four, then a mixed radix or 
“2+4” algorithm may be used. The first or last stage is computed using a 
radix-2 algorithm, while all other stages use the radix-4 algorithm. 


Table 1 gives operation counts for the DFT and various radix FFTs. In 
counting the number of real operations, note that one complex multiplication 
requires four real multiplications and two real additions, except for the case 
where the twiddle factor is W° = 1. Similarly, one complex addition requires 
two real additions. The number of complex multiplications can further be 
reduced at the expense of determining those operations involving twiddle 
factors of ±j. 


The two-dimensional DFT is given by 

X{k u k 2 ) * x(n 1 ,n 2 )r } ' 2 * inikl/Nt + n ^ /N ^ 

0 n 2 -0 


( 7 ) 


where N\ and N 2 represent the sizes of the transform in each of the two 
dimensions. The two-dimensional DFT is generally evaluated by either the 
classical row-column method or more recently by a direct method [JMS86]. 

The row-column method computes the two-dimensional FFT as a series of 
one-dimensional FFTs. Given an N l xN 2 array of data points to be 
transformed, N x one-dimensional iVj -point FFTs are computed on the rows of 
the array, followed by N 2 one-dimensional -point FFTs on the columns of 
the partially transformed array. This can be shown by rearranging the terms 
in Equation (7) as 

X(k 1 ,k 2 )^\\~ j2 ^ nikl/Nl)N ^ l x(n 1 ,n 2 )e~ i2,r{n2ki/N ^ . ( 8 ) 

»!— 0 0 

As in the derivation of the 1-dimensional FFT, repeated factorization across 
each dimension of Equation (8) gives the 2-dimensional FFT. Consequently, 
this is an 0(N 2 logN) process on a sequential machine. However, on a parallel 
machine with N x N 2 /4 processors, the time complexity reduces to O(logN). 

Alternatively, when N x = N 2 the direct method can be used. This 
method does not reduce the problem to a set of one-dimensional FFTs, instead 
it computes the 2-D FFT directly. This method is also O(logN) on a parallel 
machine with N 2 /4 processors. 
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Input Output 



Butterfly 

Input Output 



Figure 1: DIF flow graph (N=8) and detail of individual butterfly. 
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Figure 2: DIT flow graph (N=8) and detail of individual butterfly. 
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The multidimensional DFT is a direct extension of the one dimensional 
case. The D-dimensional DFT is defined for N ®* N x iV 2 • • * Nj) by 


X(k u ..., k D ) - ..., 
» 1 n 0 


( 9 ) 


A D-dimensional DFT can be evaluated with a 1-dimensional FFT by properly 
ordering the inputs and selecting the appropriate twiddle factors [ELR82]. 


2.2 The Connection Machine 2 


A large class of parallel computers are SIMD (Single-Instruction stream 
Multiple-Data stream) machines. As shown in Figure 3, an SIMD machine 
consists of a control unit, N processing elements (PEs), an interconnection 
network, and N memory modules. The control unit broadcasts a single 
instruction stream simultaneously to all N PEs. Depending on whether a PE 
is enabled or disabled, it either executes the instruction or waits. Since each 
processor operates in lock step with the control unit, synchronization is 
inherent in the design. The interconnection network provides a means for 
exchanging data among the PEs. 

The CM-2 is an SIMD machine consisting of 64K (K=1024) PEs when 
fully configured [TMA87]. The actual hardware with the floating point option 
consists of 2048 chip-sets where a chip-set includes two processor chips (32 
PEs), 256 Kbytes of memory, a Weitek 32-bit floating point unit, and a data 
transposer used to convert parallel data to serial and vice-versa. Control of 
the Connection Machine is handled by one or more front-end host processors. 
The host machine provides the user with a standard operating system, I/O 
facilities, and debugging tools. Currently, the CM-2 may be used with DEC 
VAX BI bus machines running Ultrix and/or with Symbolics 3600-series Lisp 
machines. 

Programming can be done currently in two high-level languages: *Lisp 
(pronounced star-lisp) and C* (pronounced cee-star). These languages are 
supersets of Common Lisp and C, respectively. Additional language 
constructs provide the means for expressing parallelism in algorithms for the 
CM-2. Paris (PARallel Instruction Set) is the assembly language of the CM-2. 
The Paris calls are made from within a high-level language such as C. These 
low-level instructions remove the degree of abstraction found in *Lisp and C* 
enabling the user to explicitly control the operation of the machine. 

Physically, the 64K processors are divided into four groups of 16K PEs 
(see Figure 4). Each group is controlled by a microsequencer that receives 

macroinstructions from the host machine, then decodes and broadcasts the 

> 

microinstructions to the individual PEs. These sequencers can operate 
independently to provide four users simultaneous access to one-fourth of the 
CM-2. 
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Figure 4: CM-2 organization. 
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The individual PEs include 3-input, bit-serial ALUs. This simplicity 
allows sixteen PEs to be implemented on a single chip. A 12-dimensional 
hypercube network handles the communication between PEs on different 
chips. Logically, PE 16* through PE 16* +15, where 0 < * < 4096 are on the 
same chip. Each physical PE is allotted 8 KBytes of bit-addressable memory. 
This yields a total of 512 MBytes of memory available on the CM-2. 

The data vault is a mass storage system for the CM-2. Each data vault 
has a storage capacity of 5 GBytes, expandable to 10 GBytes. The graphics 
display consists of a framebuffer module and a high-resolution 19-inch color 
monitor. The framebuffer contains a large video memory and 1/ O circuitry to 
support graphical output. 

Despite the seemingly large number of processors in the CM-2, many 
applications could benefit from additional processors. For this reason, the 
CM-2 software was designed to support virtual processors. The CM-2 uses 
time-slicing to simulate machine configurations of greater than the current 
number of available physical processors. For example, an algorithm that 
operates on 1024x1024 pixel images may be run on a 64K machine, allocating 
one pixel per processor, by using a virtual processor ratio of 16:1. In this case, 
each physical processor does the work of 16 virtual processors. 

The current configuration of the CM-2 at NAS consists of 32K processors, 
floating-point hardware, three Symbolics front-ends, one VAX-8000 Series 
front-end, and Version 5.0/3 of the system software. Since the 32K processors 
are arranged as four sections of 8K processors each, the four sequencers and 
four front-ends enable up to four simultaneous users. 

Paris has undergone major changes between Version 4.3 and Version 
5.0/3. An effort was made to establish a consistency in naming conventions 
among Paris instructions. Specifically, in Version 5.0/3 each instruction has a 
prefix of CM_ and a suffix that specifies the number of field operands and the 
number of length operands, where the field is an integer specifying the starting 
location in local memory of the operand. For example, 
CM_ 3 _abs— 1 — lh{S,16) is a Paris instruction with one field operand, S, and 
one length operand, 16. This instruction directs each enabled processor to 
take the absolute value of the 16-bit signed integer in field 5 and store the 
result back into field S. Similarly, CM. f _abs_2— 1Ij(D,S,2S,8) has two field 
operands (source field 5 and destination field D ), and one length operand. 
Each enabled processor takes the absolute value of the floating point number 
in field S and stores the result in field D. In this case, the length field 
indicates the floating point number has a 23-bit mantissa (with a hidden bit), 
and an 8-bit exponent. A 1-bit sign is implied. This is the IEEE standard 
format for a single precision 32-bit floating point number, however, the user 
may define an alternate format. 

Another significant change in Version 5.0/3 is memory allocation. In 
Version 4.3, the user could directly specify absolute memory locations of the 
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various fields. However, in Version 5.0/?, a memory management system has 
been implemented. The user indicates the length of a field and the system 
returns a field-id used to access the field. This permits system safety checking 
and more system control over memory. 

Among the new instructions, there are many transcendental functions 
that have been implemented. Of specific interest to FFT applications, the sine 
and cosine functions are now available. 

The floating point hardware consists of one Weitek 3132 floating point 
accelerator and one memory interface unit for every 32 processors. The 
memory interface unit is used to convert bit serial data to word parallel data 
and vice-versa. When a Paris floating point add, subtract, or multiply 
instruction is issued, each processor sends its first operand to its assigned 
memory interface chip. Simultaneously, these values are then transferred into 
the accelerator chip while the second operands are loaded into the memory 
interface unit. At this point the second operands are transferred into the 
accelerator and the floating point operation is executed. Finally, all 32 results 
are sent back to the memory interface unit and subsequently returned to the 
appropriate PE memory. The complete cycle requires five stages. However, if 
a virtual processor ratio of N is used, the process is pipelined as to only require 
3 N + 2 stages instead of the expected 5 N stages. 

2.3 The Cray 2 

In contrast to the CM-2, the Cray-2 is a tightly coupled MIMD (Multiple 
Instruction stream, Multiple Data stream) machine. Four high-speed vector 
processors share 256 MWords (4096 MBytes) of common memory. The 
operating system functions and I/O are handled by a dedicated foreground 
processor. Each of the four background processors include vector hardware 
pipelines. These pipelines reduce the overhead required for vector operations, 
where a vector operation is an arithmetic function applied to an entire array 
of data. 

In addition to pipelined vector operations, the Cray-2 has the ability of 
exploiting multiprocessor parallelism at various levels including loops, 
subroutines, intraprogram tasks, and independent programs. The system 
software provides a library of Fortran subroutines to synchronize processor 
execution. Because of this incurred overhead, multitasking on the Cray-2 does 
not provide optimal speedup. For this reason, most applications are run on a 
single processor, allowing other applications to use the remaining processors. 
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3. Algorithm Design 

There are many different, yet functionally equivalent, ways to implement 
an FFT algorithm. The current best implementations for vector 
supercomputers [BAI87, BAI88, SWA87] carefully tailor the calculation to 
have memory access patterns and vector operation lengths that minimize 
overhead and avoid idle time when possible. Algorithm designers also seek to 
reduce the number of operations required, by careful choice of FFT radix. 
When the utmost in performance is to be achieved, the precise details of a 
particular implementation are strongly dependent on the target computer. 

Design for best performance in a parallel computing environment involves 
not only the machine-specific algorithm implementation tuning seen for 
sequential computers, but a change in design goals. For example, the design 
goal of reducing total operation count, which may lead to better sequential 
algorithms, is the wrong focus for use when the algorithm is to run on an 
SIMD computer. Given N PEs, N operations may take no more time than one 
operation, provided the N operations involve no sequential data dependencies. 
For SIMD computing, executed instruction count reduction serves the 
analogous purpose as operation count reduction does for sequential machines. 
There are many tradeoffs available to the parallel algorithm designer. 

3.1 Tradeoffs 

One important implementation decision concerns how to efficiently 
distribute the data over the available processors. In a distributed memory 
machine, such as the CM-2, a balance should be achieved between 
computation time and communication time in an attempt to best utilize the 
architecture to achieve high performance. A low data-point to processor ratio 
would be favored given a fast interconnection network and limited 
computational power per processor. If the number of available processors is 
large enough to allow this low ratio, significant speedup may be realized 
through the high degree of parallelism. This scenario characterizes the CM-2 
without the floating-point hardware installed. Data sets with as many as 
65536 points can be processed with a 1:1 data-point to processor ratio on a 
fully configured CM-2. However, with the optional floating-point hardware 
installed, it may be advantageous to increase the number of data points per 
processor. 

A second important implementation decision is generation of twiddle 
factors. Three possible choices have been considered. The choice of which 
approach to take depends upon available memory, I/O bandwidth, and 
processing capabilities. 

(1) Table look-up. Initially, available processors compute the N f 2 twiddle 
factors and store these values in a table. For subsequent FFTs, the 
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twiddle factors would be available for use without recalculation. In a 
shared memory machine with fast access rates, this may be the best 
choice. However, in distributed memory systems multiple copies of this 
table have to be stored and therefore require large amounts of memory 
(see Table 2). 

(2) Direct calculation. The twiddle factors can be computed directly 
from the equation W k — cos[(27r/iV)fc] — js\n[(2Tr/ N)k]. Calculation of 
the trigonometric functions on each iteration may be time consuming. 
However, for large values of N, there may not be enough additional 
memory to store a table of twiddle factors. 

(3) Permutation. This method depends on a fast interconnection network. 

Initially, the twiddle factors are distributed among the processors 
according to the calculations required in the first stage. On subsequent 
stages, half of the twiddle factors are permuted each to two other 
processors. For example, using N PEs to compute a 1-dimensional N- 
point DIF FFT, the following permutation can be used. After iteration 
i, where - 1 decrements by 1 from log 2 N—l to 0, only those PEs with 
address A »■ (a n _ 1 a n _ 2 ...o 1 a 0 ) matching the tag 

(a„_x a n _ 2 ...a t >ila,-_i ...c^O) are enabled. Each of these PEs sends its 
twiddle factor to PE A /2 and to PE (A +N)/2. 

During the evolution of FFT algorithms on conventional uniprocessor 
architectures particular attention was devoted to minimizing the number of 
required multiplications and additions. This lead to the use of higher radix 
algorithms. 


The butterfly for the radix-2 DIF FFT is shown in Figure 5. An N-point 
1-dimensional FFT requires N/2 butterfly operations at each of the log 2 N 
stages. The inputs, Ac and Be, are complex numbers denoted by the 
subscript C. The real number representation of Ac =* Ar + jAi where Ar i s 
the real part and Aj is the imaginary part of the complex number (j = v — 1 ). 
Using temporary registers, Tl and T2, Figure 6 lists the actual arithmetic 
operations required to perform a radix-2 butterfly. A count of the arithmetic 
operations yields 4 real multiplications and 6 real additions. Since there are 


N 

— log 2 N butterfly operations, the exact count of operations for a complex- 


valued FFT can be computed and is shown in Table 1. 


A similar analysis can be applied to radix-4 algorithms. The radix-4 
butterfly is shown in Figure 7. In this case, four temporary registers are 
needed. Each radix-4 butterfly requires 12 real multiplications and 22 real 
additions. 


In general, a radix-r algorithm for a 1-dimensional FFT requires N/r 
butterfly operations at each of the log r N stages. Each of the N/r butterfly 
operations at stage i, where 0 < i < log r N, are independent and therefore can 
be executed simultaneously. On an SIMD machine with N/2 processors, if 
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Table 2: Twiddle factor storage requirements. 



Memory (Bytes) 

Method 

Single 

Precision 

Double 

Precision 

Table 

4Nlog 2 N—4N 

SNlog 2 N-SN 

Direct 

Of 

Of 

Permutation 

4N 

SN 


f No storage space required, assuming scratch area available. 
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B c 



Ac + B c 
(Ac - B C )W 


Ac * Ar + j Ai 
Be ■* Br + jBi 
W c “W a + jWj 


Figure 5: DIF complex, radix-2 butterfly. 


M O 
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T1 

«- 

Ar — Br 

T2 

- 

Ai — Bi 

Ar 

- 

Ar + Br 

Ai 

- 

Ai +Bi 

Br 

- 

TlxW| 

Bi 

- 

T2xWf 

Br 

- 

Br — Bi 

Bi 

- 

T2 x 

T2 

- 

Tl x Wjf 

Bi 

- 

T2 +Bi 


Figure 6: Radix-2 butterfly real-valued arithmetic operations 
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A<. +B C +C c +D C 


(A c -jB c -C c +;D,.)Wj 1 


(A,-B c +C c -D c )W“ 


(A,+jB c -C c -jDc)W“ 


Figure 7: DIF complex, radix-4 butterfly. 


TR 88.18 


- 20 - 


December 1988 


each processor perforins one radix-2 butterfly operation then the total number 
of arithmetic operations issued equals 101og 2 N. Similarly, a parallel radix-4 
algorithm using N/4 processors would require 34log4N or 17log 2 N arithmetic 
operations. Consequently, when N/2 processors are available it would be 
advantageous to use the radix-2 algorithm since 101og 2 N < 171og 2 N. 

3.2 FFTs on Conventional Vector Supercomputers 

As computer technology advances, algorithms intended to have very high 
performance must continually be analyzed to determine what characteristics 
are currently important. Previously, improving FFT algorithms meant 
reducing the number of multiplications required and performing the transform 
in-plaee to reduce storage requirements. However, with a clock cycle of 4.1 
nanoseconds and 256 MWords (2048 MBytes) of memory, the Cray-2 
illustrates that algorithm design priorities must be dynamic. Issues involved 
in algorithm design for the Cray-2 include minimizing memory contention 
through the use of unit-stride memory accesses and operating on long vectors 
[BAI87, SWA87], Since the memory is interleaved into 128 banks, the 
processor is able to fetch a word (64-bits) from memory on every clock cycle 
provided there are no conflicts with other processors and each bank is accessed 
only once every 128 clock cycles. The worst case occurs when the data has a 
stride that is a multiple of 128, which means all the data is on the same bank. 

3.3 FFTs on the CM-2 

A radix-2 algorithm was implemented in C/Paris (see CM-Jffc.c in 
Appendix) that computes the 1-dimensional transform of a set of N complex 
data points. Due to available memory restrictions and the lack of 
transcendental functions in Version 4.3 of the CM-2 software, permutation 
was chosen as the method for twiddle factor generation (see Table 2). The 
twiddle factors are computed by the front-end processor and then loaded into 
the appropriate PEs. This program requires N processors and 

6P + 2log 2 N + 1 bits of memory per processor, where P is the number of bits 
used to represent a floating point number. Figure 8 and Table 3 describe the 
memory map of each processor. 

From the formula in Table 2, the largest one-dimensional, double 
precision FFT that can be done within the bounds of the CM-2 memory space 
(512 MBytes) is 8M points. The amount of memory required for each virtual 
processor is 431 bits. Therefore, with 65536 bits of memory available per 
processor, the largest possible virtual processor ratio (must be an integer 
power of 2) is 128:1. Using a fully configured CM-2 with 65536 processors, a 
total of 128x65536 or 8M (M= 1,048,576) virtual processors are available. 
This size restriction only applies to this specific program. Larger FFTs may 
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Figure 8: Memory map of each. PE for program CM— fft.c 
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Table 3: PE memory space requirements. 


Field 

Contents 

Bits* 

A 

Self Address 

log 2 N 

B 

Real Data Value 

P 

C 

Complex Data Value 

P 

D 

Real Data Value 

P 

E 

Complex Data Value 

P 

F 

Real W Value 

P 

G 

Complex W Value 

P 

H 

Temporary Flag 

1 

I 

Destination Address 

log 2 N 

Gap 

Stack growth space 

ot 

Stack 

Not used 

of 


TOTAL 

6P+2log 2 N+l 


* Single Precision P=32, Double Precision P=64 
f No space required; algorithm does not use stack 
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be performed by using more efficient storage techniques. For example, a 
program that allocates 256 points per processor with a virtual processor ratio 
of 1:1 can transform a 16M point array. This implementation requires only 
512 KBytes of memory for twiddle factors, a savings of more than 63 MBytes. 
This savings is at the expense of throughput since each twiddle factor must be 
computed before its use. 

The C/Paris program performs the 1-D FFT with 2Nlog 2 N — 2N real 
multiplications and 3Nlog 2 N — N real additions. Since these operations are 
done in parallel on N processors, the actual serial operation count is only 
4log 2 N — 4 real multiplications and 6log 2 N — 2 real additions. Parallel 
algorithm design on SIMD machines requires that the number of serial 
operations be kept to a minimum, not so much the total number of 
operations. For example, in serial machines the total number of complex 
multiplications required in the first stage of a DIF FFT algorithm may be 
reduced from N/2 to N/2 — 4 by taking advantage of twiddle factors of ±1 
and ±j. However, on an SIMD machine such as the CM-2, all the complex 
multiplications required in the first stage are executed simultaneously if N/2 
processors are used. Therefore, it would be more efficient to actually perform 
the complex multiplications of ±1 and ±j rather than take additional time to 
disable those processors and treat them as special cases. This holds as long as 
at least one nontrivial complex multiplication must be performed. 

Figure 9 illustrates the total number of floating point instructions 
executed on the CM-2 for radix-2 and radix-4 algorithms. It is assumed that 
the radix-2 algorithm uses N/2 PEs for an N-point FFT, while the radix-4 
algorithm uses N/4 PEs. Since the CM-2 has a maximum of 64K physical 
processors, FFTs requiring more PEs use virtual processors. 

After the installation of Version 5.0/? of the system software, two 
additional C/Paris programs were written, CM-fft_t_2.c and CM_fft_t_l.c. 
Both programs perform complex-valued multidimensional FFTs. However, 
CM_fft_t_ l.c requires N PEs for an N-point FFT while CM— fft_t_2.c 
requires only N/2 PEs. For the timing measurements taken, the twiddle 
factors, W k , are assumed precomputed and available in local memory. These 
values may also be computed directly in-line as shown in the variant programs 
CM_fft_d_2.c and CM_fft_d_ l.c in the Appendix. 

In the case of CM_fft_t_l.c, only half of the processors are active at any 
given time. The butterfly operation is divided between two PEs. Specifically, 

N 

at stage i, where i decrements from log 2 N— 1 to 0, the — PEs that have an 

A 

address that matches the bit mask X IogaN-l ~ 1 0X i are enabled first. The 

N 

symbol X denotes a DON’T CARE in the mask. The other — PEs remain 

N 

idle while the enabled PEs compute x(P) + x(P + — where P is the address 
of the active PE. Subsequently, the context flag in each PE is inverted thereby 
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Figure 9: 


Floating point instruction count for radix-2 and radix-4 FFTs for 
instructions that are executed on the CM-2. 
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enabling those PEs that were previously idle and vice-versa. Then the 

N k 

remaining butterfly operation of (x(P — — ) — x(P))W K is performed. 

A 


After a butterfly is completed,, the data is realigned for the next stage of 
the FFT. This requires PE P exchange its newly computed data with PE 
P0i. Since the data are complex values, each stage requires two transfers. 


After reviewing CM_JFt_t_l.c, it should be evident that the PE 
utilization is only about 50%. Therefore, it appears possible to construct a 
program that requires N/2 PEs and does not suffer any degradation in 
performance. In fact, this is exactly what was done in CMJPLt_2.c. With 
this modification, we are now able to perform an FFT twice as large as 
previously possible using CM— fft_t_ l.c. Unfortunately, for problem sizes less 


N 

than this upper bound, the — program suffers from increased overhead 


because of the problem of aligning the data between stages. Specifically, 


N 

2 


PEs need to transfer their lower butterfly operation result, while the 
remaining PEs need to transfer their upper butterfly result. These two results 
must be sent to two different memory locations in the receiving PEs. In order 
to perform this permutation in one (actually two for complex numbers) 
interprocessor exchanges, the data must be moved to a common source 
location. This problem can be alleviated with the implementation of the 
CM_store_lL Paris instruction. This instruction allows use of indirect 
addressing techniques. Currently, extra data shuffling must be done, resulting 
in slightly lower performance figures for CMJFt_t_2 . c. 



TR 88.18 


- 26 - 


December 1988 


4. Performance 

One of the fundamental problems researchers encounter with parallel 
machines is the difficulty in adequately comparing the performance of 
machines of different architectures [DEA86, MYA88], Indeed, this is the case 
for the CM-2 and Cray-2. An effort was made to keep many of the variables 
constant. For example, all programs compared compute the complex-valued 
FFT. Whenever possible, the CM-2 programs operate on double precision 
(64-bit) floating point numbers to maintain consistency with the Cray-2 
results. However, the floating point hardware installed in the CM-2 only has 
32-bit capability. This should be kept in mind when evaluating these results. 

The two and three dimensional FFT results are expressed in MFLOPS 
(.Millions of Floating Point Operations per Second). Unfortunately, this unit 
of measure may lead to different interpretations of the results. Since it has 
been shown that a comp lex- valued radix-2 FFT can be performed using only 
5Nlog 2 N floating point operations, this number will serve as the basis for 
comparisons. That is, no matter how many floating point operations a specific 
complex- valued FFT program actually executes, the MFLOPS rate is 
computed using 

5Nlog 2 N 

execution time in microseconds 

The MFLOPS rate may be misinterpreted if this guideline is not followed. 
Consider a program that uses more than the required 5Nlog 2 N floating point 
operations. While this program might require more actual execution time 
than a radix-2 version, its MFLOPS rate could potentially be higher, giving a 
false impression of better performance. By following the imposed guideline, 
both programs would be compared based on the given formula, thereby 
producing a higher MFLOPS rate for the radix-2 program as expected. 

4.1 Measurement Techniques on the CM-2 

Few tools have been developed to allow the user of the CM-2 to 
effectively and accurately collect performance information. A tool that can be 
used is gprof, a Unix utility that produces an execution profile of C programs. 
The profile data is taken from a file that is created by programs compiled with 
the -pg option. That option also links in versions of the library routines that 
have been compiled to generate profiling information. 

First, a profile gives the total execution times and call counts for each of 
the functions in the program, sorted by decreasing time. A second listing 
shows the functions sorted according to execution time, including the 
execution time of their descendents. Associated with each function entry are 
its children, and how their times are propagated to this function. An 
advantage of gprof is that CM-2 programs do not have to be modified in 



TR 88.18 


-27- 


December 1988 


order to gather performance information. However, the results of gprof 
should only be used to compare relative performance, since gprof only profiles 
host activities and not those of the CM-2. For example, this utility might be 
used when trying to speed up a program; in this case, gprof can establish a 
basis for performance measurements. 

The CM-2 software includes a timing facility for reporting both real time 
and CM-2 active time. This facility consists of three Paris instructions: 

CM_start— timer(l) - Begins accumulating timing information 
CM_stop_timer(0) - Stops accumulating timing information 
CM_reset_timer() - Clears timing information 

These timing instructions can be inserted around any block of code in either a 
C/Paris program or C* program. If CM_stop_timer(l) is used, the timing 
information is automatically printed out. However, in Version 4.3, this 
function gives times to only two significant figures. Therefore, the programmer 
can use CM_stop_timer(0) which returns a pointer of type CM-timevaL-t 
that points to where the full precision timing data is stored. Use of these 
functions is shown in Figure 10. 

Initially CM_start_timer clears out all the sequencer queues and makes 
sure all the previous Paris instructions have finished, then it reads the system 
time from the front end and resets the idle timer in the CM-2. The idle timer 
increments a counter whenever the microsequencer is waiting for an 
instruction from the front end. After initialization, the program executes 
normally until CM—stop— timer is encountered. At this point, the system 
time and idle timer are read. The elapsed front end time, CM-2 time, and 
CM-2 utilization can then be calculated. 

For accurate results, the duration of time being measured must be much 
greater than the initialization latencies. Typically, program segments taking 
on the order of 10 seconds produce consistent results. Iterative loops can be 
used to measure the time of short segments or even single instructions. 

4.2 Experimental Results 

High performance FFT algorithm design requires a knowledge of the time 
to perform various elementary instructions and data transmissions. Table 4 
lists execution times of various Paris instructions performed on 8192 PEs 
(based on 100,000 iterations of each instruction). 

Figure 11 shows the time required for various communication tasks in the 
CM-2, and can be understood as follows. The pipeline through the router is 
29 bits (12 cube wires x 2 bits per dimension + 1 for overhead -1- 4 for 
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^include <stdio.h> 

^include <cm/cm.h> 

main () 

{ 

CM_timevaL_t *cmtvp; 

CM_init(); 

r 

* Program Segment 

7 

CM_start_timer(l); 

/* 

* Program Segment to be timed 

7 

cmtvp = CM— stop_timer(0); 

printf("Real time: %g sec; CM time: %g sec; CM utilization: %g%%'\ 
cmtvp->cmtv_real, cmtvp- >cmtv_cm, 
cmtvp- >cmtv_real > 0.0 ? 

100.0 * cmtvp- >cmtv_cm / cmtvp- >cmtv_real : 0.0); 
CM_reset_fcimer(); 

r 

* Program Segment 

7 

} 


Figure 10: Use of CM— start-timer, CM— stop— timer, and CM— reset— timer. 
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Table 4: CM-2 Version 4.3 Paris instruction timings. 


Operation 

Execution 1 

time (fJsec) 

32-bit 

64-bit 

CM—f— multiply f 

532.8 

1704.7 

CM_f_divide f 

1224.7 

4893.6 

CM_f_add f 

555.1 

1042.9 

CM_f_subtract f 

557.1 

1045.2 

CM_multiply2 ft 

1117.9 

3739.9 

CM_add2 ft 

32.2 

45.2 

CM_move 

32.5 

43.1 

CM_move_constant 

51.8 

105.7 

CM-iszero 

27.1 

33.3 

CM_send (Pi-^Pj+io) 

1353.6 

2137.9 

CM_send (P*— P i+1 ) 

365,4 

543.7 

Cubej (i<4) 

366.3 

545.3 

Cube, (4<i<13) 

1165.1 

1825.4 


f floating point operation, no floating point hardware 
ft integer operation 
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Time 

(//sec) 



Figure 11: CM-2 communication timings. 
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combining functions), therefore, the natural message length is 29 bits. Rather 
than collapse each pipeline down as the message length decreases, the router 
microcode copies the message into a temporary location if it is less than 29 
bits long, pads the message to 29 bits, then sends the message. This copying 
of shorter messages accounts for the slight increase in delivery times with 
message length increasing from one bit and then sudden decrease at a message 
length of 29 bits. 

Figure 12 illustrates an inherent difference that exists between parallel 
and vector architectures, specifically the CM-2 and Cray-2. On this semilog 
graph of FFT performance figures, the time required on a SIMD machine, the 
CM-2, follows a straight line. However, the time required by the Cray-2 
[BAI88] follows an exponential curve. The reason for this difference is that as 
the size of the 1-dimensional FFT increases, the number of processors in the 
CM-2 is also allowed to increase. At the same time, the amount of hardware 
represented by the Cray-2 remains constant. 

Table 5 includes times realized from executing various size one- 
dimensional FFTs on the CM-2 Version 4.3 with 8K physical processors. 
Times do not include down-loading the initial data or unscrambling the results 
(bit-reversal). 

Additional execution timing measurements were taken based on the two 
Version 5.0/? C/Paris programs, 2.c and CM_fft_t_l.c. Table 6 

contains these performance measurements for both the Cray-2 and CM-2. As 
expected, the performance figures for the Cray-2 reflect the advantage of using 
longer vector length operations. For example, a 4096 point FFT arranged as a 
128x32 two-dimensional transform results in a higher MFLOPS rate than the 
same size FFT arranged as 16x16x16 in three dimensions. However, this is 
not the case for the CM-2. The performance figures for the CM-2 depend on 
the size of the FFT rather than the dimensions of the transform. 

The results in Table 6 demonstrate that the CM-2 is most effective on 
large FFTs, that is, when all of the available processors are utilized. 
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Table 5: 


1-D comp lex- valued FFT timings using CM-JFt.c in the Appendix, 
running under Version 4.3. 


Size 

Log(Size) 

Virtual 

Processor 

Ratio 

Execution time (ms) 

Double 

Precision 

Single 

Precision 

131072 

17 

16:1 

7229.0 

3806.0 

65536 

16 

8:1 

3431.0 

1807.0 

32768 

15 

4:1 

1345.4 

704.1 

16384 

14 

2:1 

595.3 

306.3 

8182 

13 

1:1 

275.2 

140.9 

4096 

12 

1:1 

249.4 

127.4 

2048 

11 

1:1 

223.1 

113.5 

1024 

10 

1:1 

198.0 

100.4 

512 

9 

1:1 

171.0 

86.2 
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Table 6: Multidimensional comp lex- valued FFT MFLOPS 


FFT 

Size 


Cray-2 


1 CPU 


CM-2 (Version 5.0/?) 


CM—fFfc— t_2.c 
N/2 PEs 


N PEs 


32x32 

32x64 

32x128 

32x256 

32x512 

32x1024 


118.58 

172.37 

168.79 

181.19 

189.14 

194.01 


1.95 

3.77 

7.30 

14.45 

28.50 

55.25 


2.04 

4.00 

7.90 

15.66 

30.51 

61.32 


64x32 

64x64 

64x128 

64x256 

64x512 

64x1024 


167.26 

220.58 

236.36 

255.82 

230.15 

249.32 


3.77 

7.40 

14.43 

28.02 

55.55 

111.17 


4.00 

7.85 

15.60 

30.70 

60.97 

N/A 


128x32 

128x64 

128x128 

128x256 

128x512 


177.66 

231.82 

243.00 

244.98 

243.84 


7.30 

14.43 

28.27 

54.72 

111.17 


7.90 

15.60 

30.84 

59.26 

N/A 


256x32 

*256x64 

256x128 

256x256 


194.82 

245.24 

246.94 

242.02 


14.45 

28.02 

54.72 

111.17 


15.66 

30.70 

59.26 

N/A 


512x32 

512x64 

512x128 


189.76 

242.54 

248.78 


28.50 

55.55 

111.17 


30.51 

60.97 

N/A 


1024x32 

1024x64 


195.19 

242.11 


55.25 

111.17 


61.32 

N/A 


16x16x16 

32x32x32 


45.26 

78.84 


7.40 

56.04 


8.03 

61.47 
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5. Conclusion 

As the computational demands placed on scientific computers continues 
to escalate, massively parallel computers, such as the CM-2, will play a vital 
role in meeting this demand. The various algorithm design tradeoffs presented 
and analyzed in this report are intended to provide a better understanding of 
those characteristics significant to parallel execution. 

Because of the highly parallel nature of FFT algorithms, it is 
advantageous to utilize the maxim um number of available processors by 
evenly distributing the data among them. When N / 2 PEs are available the 
radix-2 algorithm yields the best results. However, additional work is being 
done to analyze what effect increasing the virtual processor ratio will have on 
performance. 

In applications that require computation of a large number of successive 
FFTs, precalculation of the twiddle factors will result in a substantial savings 
in overall execution time. However, as the size of the FFT grows, the memory 
requirements of this table look-up may not be feasible. In this case, the 
permutation scheme outlined helps to alleviate this problem. For applications 
requiring only a limited number of FFTs, the direct calculation scheme 
becomes the logical choice. 

The C/Paris programs written use the general hypercube router of the 
CM-2 for interprocessor communications. However, since the interconnection 
patterns required for the FFT use only nearest-neighbor communication along 
the hypercube, the full functionality of the general router creates unnecessary 
overhead. For this reason, an investigation is also being made into the 
possible use of more primitive routing instructions. 
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APPENDIX: 
C/Paris Programs 
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CM_ift.c 


- Ray Kamin - June 2, 1988 
Paris Version 4.3 


This program performs a 1-D DIF FFT on NP complex 
points using NP processors. The data is loaded in 
by the front end processor in normal order and read 
back in bit-reversed order. The twiddle factors 
are computed by the front-end and down loaded into 
the appropriate PEs of the CM-2. The twiddle 
factors are permuted for subsequent iterations. 


Use <cc CMJFt.c -lparis -lm -o fft> to compile 


* Note: When changing NP, you must also change K to 

* its corresponding value (i.e. NP=2 A K). 

7 

#include <math.h> 

#include <stdio.h> 

#include <cm/cm.h> 


#define PI 3.14159265358979 

#define NP 2048 /* Number of processors * / 

#define K 11 /* 2**K processors */ 

#define BASE 0 /* Base address of memory in PE */ 

#define MAN 52 /* Length of mantissa S=23 D=52 

#define EXP 11 /* Length of exponent S=8 D=ll * 


v 

/ 


main () 

{ 

double arg, real[NP], complex[NP], omega^r[NP/2], omega_c[NP/2]; 
int reverse, reg[l0], wdsise, i, k; 

CMJnitQ; 

k=K; 

arg=2*PI/NP; /* Used to compute twiddle factors */ 

/* Establish some meaningless data */ 

for (i=0; i<NP; i-H-) 

{ 

real[i]=cos((double) arg*i); 
complex[i]=sih((double) arg*i); 

} 
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/* Set up the complex omega values */ 

for (i=0; i<NP/2; i++) 

{ 

omega— r[i]=cos((double) (arg*i))j 
omega-c[i]=0-(sin((double) (arg*i))); 

} 

/* Set up the static memory in each PE 
* 

* reg[0] - CM-cube-address-length-bit self-address 

* reg[l] - Real part of first data element 

* reg[2] - Complex part of first data element 

* reg[3] - Real part of second data element 

* reg[4] - Complex part of second data element 

* reg[5] - Real part of twiddle factor 

* reg[6] - Complex part of twiddle factor 

* reg(7] - 1-bit context flag for PEs < NP 

* reg[8] - Temporary address register 

7 

reg[0]=*BASE; 

wdaiie=MAN+EXP+l; 

for (i*l; i<=7; i-Hf) 

reg[i]=reg[Oj-fCM-cube-address_length-f-(i-l)*wdsiie; 
reg[8]— reg[7]+l; 

CM-set_jtack_limit(reg[8]+CM_.cube_addreaa_length); 
CM_3et-stack_upper-bound(CM_user-memory_address— limit); 

CM— reset— stack— pointer; 

CM-move-constant-always(CM-context-flag,l,l); 
CM-my_cube_address(reg [0] ); 

CM-U-le_constant(reg[0],NP-l,CM_cube_address-length); 

CM-move(reg[7],CM-test-flag,l); 

CM— move ( CM-con text_fl ag, CM-test— flag , 1); 

/* Load PEs with data in normal order */ 

for (i=0; i<NP/2; i++) 

{ 

CM-f-write-to-processor(i,reg[l] ,real[i] , MAN, EXP); 
CM-fLwrite_to-processor(i,reg[2] ,complex[i] , MAN, EXP); 
CM_f_write_to-processor (i+NP / 2 ,reg [l] ,real [i+NP / 2] , MAN, EXP ) ; 
CM_f_write_to_processor(i-hNP /2,reg[2] ,complex[i+NP/2] , MAN, EXP); 
CM-f-write-to-processor(i+NP /2,reg[5],omega— r [i] , MAN, EXP); 
CM-f_write_to_processor(i-hNP/2,reg[6] , omega— c[i], MAN, EXP); 

' } 


fft(k,reg); /* Execute the FFT */ 

/* Read data from all PEs in bit-reversed order */ 
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for (i=0; i<NP; i++) 

{ 

reverse=ibitr(i,k); 

real [reverse] = CM_f_read_from_processor (i,reg [l] ,MAN ,EXP ); 
complex [reverse] =CM_f_read_from_processor(i,reg [2 j , MAN, EXP); 

} 

if (NP<=16) 

{ 

for (1=0; i<NP; i++) 

{ 

printff'PE %d — %g + j%g\n",i,real[i],complex[i]); 

} 

} 

else 

{ 

printff’PE %d = %g + j%g\n",0,real[0],complex[0]); 
printff’PE %d — %g + j%g\n’’,NP-l,real[NP-l],complex[NP-l]); 

} 


j* 1-D DIF FFT Function 
* 

* This function executes the DIF radix-2 CTA on the data in 

* the PEs. The data is NOT unscrambled before returning to 

* the main program. 

* 

*/ 


fft(k,reg) • 
int k, reg[10]; 

( 

int i 9 xp; 

CM_timeval_t *cmtvp; 
CM-start_timer(l); 
*P— 1* 

xp«=(k-l); 


for (i=k-l; i>=0; i— ) 

{ 

/* Exchange data between PEs * / 

/* actually doing CUBE(i-l) */ 
CM_move_always(CM_con text-flag, reg [7 ] ,1 ); 

CM_u_move_constant(reg[8],xp,CM_cube_address_length); 
CM-logxor (reg [8] ,reg [0] ,CM_cube_address_length); 
CM_send(reg[3] ,reg[8] ,reg[l] ,MAN+EXP+1 ,CM_NOLIMIT); 
CM_send(reg[4],reg[8],reg[2],MAN+EXP+l,CM_NOLIMIT); 

xp»==l; 

/* Select PEs for TOP btrfly*/ 
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CM-issero(reg [0] -hi, 1 ); 

CM— move(CM_context— flag, CM-test— flag, 1); 

/* DO X=A+B V 

CM-f-add(reg(l],reg[3],MAN,EXP); 

CM_fladd(reg [2] ,reg[4] , MAN, EXP ); 

/* Select PEs for BOTTOM btrfly*/ 

CM-move_always(CM-con text-flag, reg [7] ,1); 
CM-move(CM-context-flag,reg[0]-hi,l); 

/* DO X=(A-B)* twiddle factor */ 
CM-f-subtract(reg[3],reg[l],MAN,EXP); 

CM-f-sub tr act (reg[4] ,reg[2] , MAN, EXP); 

CM-f-move(reg[l] ,reg[3] , MAN, EXP); 

CM_flmoye(reg[2] ,reg [4] , MAN, EXP ); 

/* Not necessary to do mult. */ 

/* if i=0 since twid. f =1 */ 

if (i>0) 

{ 

CM-flmultiply (reg [3] ,reg [5] , MAN, EXP ); 

CM-flmultiply (reg[4] ,reg[8] , MAN, EXP); 

CM-JLsub tract(reg [3] ,reg [4j , MAN, EXP); 

CM-flmultiply (reg [l] ,reg [8] ,MAN,EXP); 

CM-f-muItipIy (reg [2] ,reg [5] ,MAN,EXP); 

CM-fladd (reg [2] ,reg [l] ,MAN,EXP ); 
CM_flmove(reg[l],reg(3],MAN,EXP); 

{ 

/* Permute twiddle factors */ 

CM_isiero(reg[0] ,1); 

CM-moTe(CM_context-flag,CM_test-flag,l); 

CM-moTe(CM_con text-flag, reg [0] +i,l ); 

CM-moye-constant(reg[3],-l,MAN-hEXP-hl); 

CM-U-moye(reg[8] ,reg[0],CM_cube_addresaJength); 

CM— shift (reg [8] ,reg [3] , CM— cube— addresa-len gt h ,MAN-hEXP -hi ); 
CM-send (reg [3] ,reg [8] ,reg [5] ,MAN+EXP -hi , CM-N OLIMIT ) ; 
CM-send (reg [4] ,reg [8] ,reg [8 j ,MAN+EXP-hl ,CM-NOLIMIT ); 
CM-U— add-constant(reg [8] ,NP /2,CM_cube_address-length); 
CM-send(reg[3],reg(8],reg[5]^tAN-hEXP+l,CM-NOLIMIT); 
CM_send(reg[4] ,reg [8] ,reg [6] ,MAN+EXP-hl ,CM-NOLEMIT); 

CM-move-always(CM-context-flag,reg[7],l); 

CM-f move(reg[5] ,reg[3] , MAN, EXP); 

CM-flmove(reg[6] ,reg[4] , MAN, EXP); 

} 

} 

} 

cmtvp = CM-stop-timer(O); 

printf("Real time: %g sec; CM time: %g sec; CM utilization: %g%%\n'\ 
cmtvp->cmtY-cm, cmtyp*>cmtv-real, 
cmtyp->cmtv-cm > 0.0 ? 

100.0 * cmtyp->cmtv-real / cmtvp->cmtv_cm : 

0.0); 
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} 


/* Bit Reversal Fuction 
* 

* This function takes an input j of length k and returns the 

* bit-reversed value. For example ibitr(5,4)=10 since if 0101 

* is reversed, 1010 results. 

* 

7 

int ibitr(j, k) 
int j,k; 

{ 

int rev, jl, j2, i; 

ji=j; 

rev=0; 

for (i=l; i<=k; i-H*) 

{ 

j2— jl/2; 

rev=rev*2+(jl-2*j2); 

jl=j2; 

} 

return(rev); 

} 



CM_ cube.c 


CM-cube.c - Ray Kamin - June 16, 1988 
Paris Version 4.3 

This program performs cube functions 
on NP processors using the router. 

Use <cc CM_cube.c -lparis -lm -o cube> to compile 

/ 


#include <stdio.h> 
#include <cm/cm.h> 

# define NP 8192 
#define BASS 0 
# define MAN 52 

# define EXP 11 


main () 

{ 

int reg[4], wdsise; 
CM_init(); 


/* Number of proc' 
/* Base address of 
/* Precision; S=2i 
/* Precision; S= 8 


/* Set up the static memory in ea 
* 

* reg[0] - CM-cube-addresa_length-bi 

* reg[l] - CM-cube-address-length-bl 

* reg[2] - Data Register 

* reg[3] - Data Register 

v 

wdsize=MAN+EXP +1 ; 
reg[0]=BASE; 

reg [1 ] = reg [0] +CM_cube_address-length; 
reg[2]=reg[l]+CM_cube_address_ length; 
reg[3]= reg[2]+wdsize; 

CM-set_stack_limit(reg [3] + wdsize+10); 
CM_set-stack— upper— bound (CM_user-memoi 
CM_reset_stack_pointer; 

CM-move_constant-always(CM_context-flag, 
CM-my_cube-address(reg[0] ); 

CM_u_lt_const an t (reg [0] ,NP , CM-cube_addres 
CM— move(CM_context_flag,CM_test— flag, 1); 


cube(reg); 
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} 

/************ Execute Cube sub . Functions ****************/ 

cube(reg) 
int reg[4]; 

{ 

CM-timevaLt *cmtvp; 
int size, xp, t; 

size=0; 

while (size<13) 

{ 

printff'Cube sub 
scanf("9Sd",&size); 

xp=l; 

xp«=size; 

CM_move-constant-always(CM_context-flag,l,l); 

CM_u— move-constant(reg(l],xp,CM_cube-address-length); 
CM-logxor(reg [l] ,reg [0] ,CM-cube_address-lengt h); 

CM— start— timer (1 ); 

for (t=l; t<=100000; t++) 

CM-send(reg[3] f reg[l],reg[2]^dAN+EXP-fl,CMLNOLIMIT); 
cmtvp = CM-stop-timer(O); 
printf( ’Heal time: %% sec; CM time: %% sec; 

CM utilisation: %g%%\n” 9 
cmtyp->cmtv-real, cmtvp->cmtv_cm, 
cmtyp->cmtT*real > 0.0 ? 

100.0 * cmtvp->cmtv_cm / cmtvp->cmtv_real : 

0 . 0 ); 

CM_reset_timer(); 

} 


} 
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CM_send— message .c 


r 

4 CM_send_message.c - Ray Kamin - June 9, 1988 

4 Paris Version 4.3 
* 

4 This program sends various length messages 
4 though the router in an effort to establish 

* a feel for the performance of the network. 

* 

* 

* Use <cc CM-send— message.c -lparis -o send> to compile 

* 

7 

#include <stdio.h> 

#include <cm/cm.h> 

#define BASE 0 /* Base address of memory in PE */ 

#define MSGLEN 128 /* Maximum message length */ 


main () 

{ 

int reg{4], i; 
CM_init(); 


/* Set up the static memory in each PE 
* 

4 reg[0] - CM_cube_address_length-bit self- address 
4 reg[l] - CM_cube_address_length-bit dest. address 
4 reg[2] - <MSGLEN> length register 
• reg[3] - <MSGLEN> length register 

7 

reg[0]=BASE; 

reg[l]=reg[0]-hCM-xube-address_length; 
reg [2] =reg [l] +CM-.cube-address_ length; 
reg[3]==reg[2]+MSGLEN; 

CM-set-stack-limit (reg [3] +MS GLEN); 

CM-set-stack_upper_bound(CM_user_memory_address-limit); 

CM_reset_stack_pointer; 

CM_move_constant_always(CM_context_flag,l>l); 

CM-my-cube_address (reg [0] ); 

/ 4 Permutation : PE i -> PE i+16 */ 

CM-move-constant(reg[l],lS,CM-cube_address-Jength); 
CM_add2 (reg [l] ,reg [0] , CM_cube_address_length); 
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printf( M Sending . . . \n"); 
send(reg); 

} 

^************** transfer message routine ******************/ 

send(reg) 
int reg(4]; 

{ 

int t, size; 

CM_timevaLt *cmtvp; 
size=l; 

while (size !== 0) { 

printf("Enter the message length— >"); 
scanf("9Sd ,# ,&size); 

CM_st art-.timer( 1 ); 

for (t=l; t<= 100000; t-H-) 

ClvLsend(reg [3] ,reg[l] ,reg{2] ,size); 
cmtvp = CM_stop_timer(0); 
printf( ’Heal time: %g sec; CM time: %g sec; 

CM utilization: %g%%\n" 3 
cmtvp->cmtv_real, cxntvp->cmtv_cm, 
cmtTp->cmtv-real > 0.0 ? 

100.0 * cmtTp->cmtT_cxn / cmtvp->cmtv_real : 

0 . 0 ); 

CM_reset_timer(); 

} 
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CM-JftJJ. . C 


CM_fft_t_l.c - R. Kamin - Sept. 1, 1988 
Paris Version 5. OB 

This program performs a multidimensional DIF FFT on N 
complex points using N processors. The data is loaded 
in by the front end processor in normal order and read 
back in bit-reversed order. The twiddle factors 
are assumed precomputed by the CM- 2 processors. 


Use <cc XXX.c -lparis -1m -o fft> to compile 


* Note: Must use 32-bit floating point to use FPU 

v 

^include <math.h> 

#include <stdio.h> 

#include <cm/paris,h> 

#define PI 3.14159265358979323846 
#define MAX-N 32769 

#define MAN 23 /* Length of mantissa; D=52 S=23 */ 

#define EXP 8 /* Length of exponent; D=ll S=8 * / 

#defin« INT-SIZE MAN+EXP+1 /* Length of integer */ 

# define MAX(A, B) ((A) > (B) ? (A) : (B)) 


main () 

{ 

int max-dim, N, reg[lO], wdsiie, i, n, base[l5], site[l5]; 
CM_init(); 

printf("Enter the number of dimensions *>"); 
scanf("%d",&max_dim); 

baae[0]=0; 


for (iss-l; i<=max_dim; i++) 

{ 

printf("Size of dimension %d (in log base 2) ->”,i); 
scanf(”%d”,&sise [i] ); 
base [i] = base [i-1] -l-si*e [i] ; 

} 


n=base[max_dim] ; 
N=(l«n); 
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/* Set up the static memory in each PE 

• 

• reg[0] - Self- address 

• reg[l] - First data register (REAL PART) 

• reg[2] - First data register (COMPLEX PART) 

• reg[3] - Second data register (REAL PART) 

• reg[4] - Second data register (COMPLEX PART) 

• reg(5] - 1-bit context flag for PEs < N 

• reg [6] - Destination address register 

• reg[7] - Twiddle factor register (REAL PART) 

• reg[8] - Twiddle factor register (COMPLEX PART) 

v 

wdsize=MANn-EXP +1 ; 

reg [0] =ss CM-allocat e-stack_field (INT-SIZE) ; 
reg[l]=CM_allocate-jtack-field(wdsize); 
reg[2]=CM-allocate-stack_field(wdsize); 
reg [3] = CM_allocate_st ack-field(wdsize) ; 
reg[4]=CM-allocate^tack— field(wdsize); 
r eg [5] = CM_alloc ate_st ack_field ( 1 ) ; 
reg[6]=CM-allocate-stack-field(INT-SIZE); 
reg(7]=CM-allocateu-stack-field(wdsize); 
r eg[8]=CM_aUocate_atack_field(wdsi»e); 

CM-set-context; /* Activate all PEs */ 

CM_u— move_constant_lL(reg[5],0,l); 

printf( M Setting addresses of %d processors ”,MAX(N,8192)); 

for (i=0; i<MAX(N,8192); i++) 

CM-u-.write-to-processor_lL(i,reg[0] ,i, INT-SIZE); 
printf( M done!\n M ); 

printf("Activating %d processors . . .”,N); 

CM-uJt»constant_lL(reg[0],N,INT-SIZE); /* Select low N PEs */ 

CM-load-context(CM-test-flag); 

CM-U-mo ve-cons t an t- 1 L (reg [5 ] , 1 , 1 ) ; 
printf( M done!\n”); 

/* Load PEs with data in normal order */ 
printf("Loading data . . ."); 

CM-fLmove-constant_lL(reg[l],0.0,MAN,EXP); 
CM-f-move-constant-lL(reg(2],0.0,MAN,EXP); 
CM-f-move-constant-lL(reg[3] ,0.0, MAN, EXP ); 
CM-f-move-constant-lL(reg(4],0.0,MAN,EXP); 

CM-f-move-.constant-.lL (reg (7 ] , 0 .0,MAN ,EXP ); 

CM_f-move-constant_lL (reg [8] ,0.0, MAN, EXP ); 
printf("done!\n"); 

fft(n, reg, base, max-dim); /* Execute the FFT */ 


} 

/*********************** Dip FFT Function ******************************/ 



TR 88.18 


- 51 - 


December 1988 


fft(n, reg, base, dimension) 

int n, reg[10], base[15], dimension; 

{ 

int lfactor, q, t, i, two-i, current-dim, previous-dim; 
double arg; 

CM-timevaLt *cmtvp; 

prinfcf("Enter loop factor ->”); 
scanf( M %d",&lf actor); 
printf(”\n H ); 
q=dimension; 

CM-star t-timer( 1 ) ; 

for (t=0; t<lfactor; t-H-) 

{ 

dimension=q; 


for (i=n-l; i>=0; i— ) 

{ 

if (i<ba8e[dimension]) 
dimension— dimension- 1 ; 
current— dim == base [dimension] ; 
previous_dim==base[dimension+l] ; 

/* Exchange data between PEs */ 

/* actually doing CUBE(i) */ 
two_i=(l«i); 

CM-U_move-aIways_lL(CM_context_flag,reg[5],l); 
CM-logxor-constant_3_lL(reg[8] ,reg [0] ,two_i,INT-SIZE); 
CM_send-lL(reg[3],reg[8],reg[l],MAN+EXP+l); 
CM-send-lL(reg[4] ,reg[6] ,reg[2] ^dAN+EXP-hl); 

/* Select PEs for TOP butterfly */ 

CM-U— eq-zero— 1L (reg [0] H-i, 1 ); 

CM-load-context(CM— test— flag); 

/• DO X=A+B V 

CM-£_add_2_lL(reg[l] ,reg [3] ,MAN,EXP); 

CM-f-ad d— 2— 1L (reg [2] ,reg[4] ,MAN,EXP); 

/* Select PEs for BOTTOM butterfly */ 
CM-u_move-always-lL (CM-con text-flag, reg [5] , 1 ); 
CM-load-test (reg [0] +i); 

CM-loadLcontext(CM-test-flag); 

/♦DOX-(A-B) V 

CM-£_sub tract-2- 1 L (r eg [l] ,reg [3] ,MAN ,EXP ) ; 

CM-fLsub tract-2-lL(reg[2] ,reg[4] , MAN, EXP); 

/* Not necessary to do mult. */ 

/* if i=current_dim since w=l */ 
if (i>current-dim) 

{ 


/* Multiply by twiddle fact. */ 
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CM-JLjnultiply-3-lL(rcg[4] f reg[2],reg[8],MAN,EXP); 

CM-JLmultiply-2_lL (reg [2] ,reg [7 j ,MAN >EXP ) ; 

CM-fLmultip ly-2-lL (reg [7 ] ,reg [l ] ,MAN ,EXP ) ; 
CM-f-multiply^2_lL(reg[l],reg[8] > MAN,EXP); 

CM-fLadd-2_lL (reg [2] ,rcg[l] ^MANyEXP); 

CM-fLsub tract_3_lL (reg [1] ,reg (7 ] f reg [4] ,MAN,EXP ); 

} 

} 

} 

cmtvp=CM_stop_timer(0); 

i=*(l«n); 

printf( H CM_fft_t_l.c Paris Version 5.0B\n"); 

printf("\nTimes computed for a single %d-point %d-dimension complex FFT\n",i,q); 
printf("Times based on a loop factor of %d and using %d PEs\n",lfactor,i); 
printf( M Twiddle factors are precomputed and looked-up.\n"); 
printf("Floating point hardware is ACTIVEIXn"); 

printf( w — \n”); 

printf("Real time: %g sec; CM time: %g sec; CM utilisation %g%%\n\ 
cmtvp->cmtv-jeal/lfactor, cmtvp->cmtv_cm/lfactor, 
cmtvp->cmty-real > 0.0 ? 

100.0 • cmtvp*>cmtv-cm / cmtvp->cmtv_real:0.0); 
printf(”Using a 5NlogN counter, %g MFLOPS\n' , ,.000005*i*n/cmtvp->cmtv_cm*lf actor); 

CM-jeset-.timerO; 

} 
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r 

* CMJfL-t-2.c - R. Kamin - October 22, 1988 

* Paris Version 5.0B 

* 

* This program performs a multidimensional DIF FFT on 

* N complex points using N/2 processors. The data is 

* loaded in by the front end processor in normal order 

* and read back in bit-reversed order. The twiddle 

* factors are assumed precomputed by the CM-2 PEs. 


Use <cc XXX.c -lparis -2m -o fft> to compile 


• Note: Must use 32-bit format to take advantage of FPU 

v 

#include <math.h> 

^include <stdio.h> 

#include <cm/paris.h> 

# define PI 3.14159265358979323846 
#define MAX-N 32789 

#define MAN 23 /* Length of mantissa; D=52 S=23 */ 

#define EXP 8 /* Length of exponent; D=ll S— 8 */ 

#define INT-SIZE MAN+EXP+1 /* Length of integer */ 

# define MAX(A, B) ((A) > (B) ? (A) : (B)) 


main () 

{ 

int max_dim, N, reg[17], wdsize, i, n, base[15], sise[15]; 
CM_init(); 

printff*Enter the number of dimensions 
scanf(”%d M ,&max_dim); 

base[0]=0; 


for (i=l; i<=max_dim; i++) 

{ 

printf(”Size of dimension %d (in log base 2) ->”,i); 

scanf("%d' , ,&sise[ij); 

base[i]=base[i-l]+si*e(i]; 

} 


n=base[max_dim] ; 
N=(l«n); 
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I* Set up the static memory in each PE 

* reg[0] - SR = Self-address Register 

* reg[l&2] - Rl — Complex Register #1 

* reg[3&4] - R2 = Complex Register #2 

* reg[5&6] - Tl = Temporary Complex Register #1 

* reg[7&8] - T2 = Temporary Complex Register #2 

* reg[9&10] - W = Complex Twiddle Factor Register 

* reg[ll&12] - DTRl =*= Complex Data Transfer Register #1 

* reg[13&14] - DTR2 = Complex Data Transfer Register #2 

* reg[l5] - DR Destination address Register 

* reg[16] - CFS — Context Flag storage register 

v 


wdsi*e=MAN+EXP+l; /* Floating point number length */ 

reg [0] =sCM_allocate-stack_lield(INT_SIZE); 
for (i=l; i<15; i++) 
reg [i] = CM— allocate_st ack— field (wdsiie); 
reg[15]=CM-allocate_stack_field(INT_SIZE); 
reg [l 6] = CM-allocate-st ack_field(l ); 

CM— set— context; /* Activate all PEs */ 

CM_u-move-constant-lL(reg[16],0,l); 

printf( M Setting addresses of %d processors . . .”,MAX(N/2,8192)); 
for (1=0; i<MAX(N/2,8192); i++) 
CM-u_write_to-proccssor-lL(i,reg[0],i,INT-SIZE); 
printf( H done!\n"); 

printf(" Activating %d processors . . . M ,N/2); 

CM-uLit-constant_lL(reg[0],N/2,INT-SIZE); /* Select low N/2 PEs */ 
CM-load-context(CM_test_£ag); 

CM_u_move-constant— 1L (reg [l 6] , 1 , 1 ); 
printf( M done!\n"); 

/* Load PEs with data in normal order */ 
printf( ,f Loading data • • 

CM-fLmove-constant-lL(reg[l],0.0,MAN,EXP); 

CMJLmove-constant-lL (reg [2] ,0.0 , MAN, EXP ); 
CM-fLmore-constant-lL(reg[3] ,0.0, MAN, EXP) ; 

CM-f-mo ve_.const an t-1 L (reg [4] , 0 .0 ,MAN ,EXP ) ; 
CM-fLmove^.constant-lL(reg[9] ,0.0, MAN, EXP); 
CM-fLmove-constant-lL(reg(l0],0.0,MAN,EXP); 

printf(”done!\n"); 

fft(n, reg, base, max_dim); /* Execute the FFT */ 


} 

^*«****4******4***4*« M-D DIF FFT Function ******************************/ 
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fft (n ,reg,b ase , dimension) 

int n, reg[l7], base[l5], dimension; 

{ 

int If act or, q, t, i, half-two— i, current-dim, previous-dim; 
double arg; 

CM-timevaLt *cmtvp; 

printf('TSnter loop factor ->"); 
sc anf("%d”,&If actor); 

printin’); 

q= dimension; 

CM-start-timer(l); 

for (t=0; t<lfactor; t-H-) 

{ 

dimensions q; 


for (i=n-l; i>=0; i— ) 

{ 

if (Kbase [dimension]) /* Compute which dimension */ 

dimension^ dimension- 1 ; 
current— dim= b ase [dimension] ; 
preYious_dim=baae[dimension+l] ; 

halfLtwo-i=(l«(i-l)); 

CM-load-context(reg[l6]); /* Enable N/2 PEs */ 

CM-fLadd-3_lL(reg[5],reg[l],reg[3],MAN,EXP); /* Tl=Rl+R2 
CM-f-add-3-lL(reg[6] ,reg [2] ,reg [4] , MAN, EXP ); 
CM-fLsubtract-3-lL(reg[7],reg[l],reg[3],MAN,EXP); /* T2=Rl-R2 
CM-f_jubtract-3-lL(reg[8] ,reg[2] ,reg[4] ,MAN,EXP); 

if (i>0) 

{ 

if (i==current-dim) 

{ 

CM-iLmove-lL(reg[ll],reg[7],MAN,EXP); 

CM-fLmove-lL(reg[l2],reg[8]^lAN,EXP); 

> 

else 

{ 

/* DTRl=T2*W */ 

CM-fLmultiply_3_lL(reg[ll],reg[7],reg[9],MAN,EXP); 
CM-f-multiply-3_lL(reg[12],reg[8],reg[lO],MAN,EXP); 
CM-f_sub tract— 2_1 L (reg [1 1 ] ,reg [l 2 ] , MAN, EXP ) ; 
CM-f_multiply-3-lL (reg [l 2 ] ,reg [8] ,reg [9] ,MAN ,EXP ) ; 
CM-f-multiply-3_lL(reg[l3],reg[7],reg[10],MAN,EXP); 
CM-f-add_2-lL (reg [1 2] ,reg [13] , MAN, EXP ); 

} 

/* Select PEs for BOTTOM butterfly */ 
CM-load-test(reg[0]+i-l); 
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CM-Joad_context(CM_test_flag); 

CM_f_move_lL(reg[7],reg[ll],MAN,EXP); /* T2=DTRl */ 

CM_f_move_lL(reg[8],reg[l2],MAN,EXP); 

CM-f_move_lL(reg[ll],reg[5],MAN,EXP)5 /* DTRl=Tl */ 

CM-f-move-lL(reg[12],reg[6] > MAN,EXP); 

/* Select all N/2 PEs */ 

CM_load_context(reg [l 6] ); 

CM_logxor_constant_3_lL(reg[l5],reg[0],half_two_i,INT_SIZE); 
CM_aend_lL(reg[13],reg[l5],reg[ll],MAN+EXP+l); /* DTR2 <- DTRl */ 
CM_send_lL(reg[l 4] ,reg[15] ,reg[12] ^MAN+EXP +1); 

CM_i_move_lL (reg [1] ,reg [5] , MAN, EXP ); /* Rl=Tl */ 

CM_f_move_lL(reg[2],reg[0j,MAN,EXP); 

CM-f_move.lL (reg [3] , reg [1 3] , MAN, EXP ); /* R2=DTR2 */ 

CMJlmove_lL(reg[4],reg[l4],MAN,EXP); 

/* Select PEs for BOTTOM butterfly */ 

CMLload_test(reg [0] +i-l ); 

CM-load— eontext(CM-test-flag); 

CM_fLmove_lL(reg[l],reg[13],MAN t EXP); /* Rl=DTR2 •/ 

CM-f.moveL.lL (reg [2] , reg [1 4] , MAN, EXP); 

CMJLmove_lL(reg[3j,reg[7],MAN^XP); /* R2=T2 */ 

CM..f-move-lL(rcg[4],reg[8] > MAN,EXP); 

> 

else 

{ 

CMJLmove_lL(reg[l],reg[5l,MAN,EXP); /* Rl=»Tl */ 

CM_f_move_lL(reg[2] ,reg[6] ,MAN,EXP); 

CMJLmove_lL(reg[3],reg[7]^IAN,EXP); /* R2=T2 */ 

CM_f_move_lL(reg[4],reg[8} > MAN,EXP); 

} 

} 

} 

cmtvp= CM-stop.timer(O); 

i=(l<<n); 

printf(" CM_fft_t_2.c Paris Version 5.0B\n"); 

printf("\nTimes computed for a single %d-point %d-dimension complex FFT.\n',i,q); 
printf("Times based on a loop factor of %d and using %d PEs.\n‘',lfactor4/2); 
printf("Program uses table look-up for twiddle factor generation.\n"); 
printf("Floating point hardware is ACTrVE!\n M ); 

printf( H - \ n w ); 

printf("Real time: %g sec; CM time: %g sec; CM utilisation %g%%\n'\ 
cm tvp->cmtv-jeal/lf actor, cmtvp->cmtv_cm/lfactor, 
cmtvp->cmtv-real > 0.0 ? 

100.0 • cmtrp->cmtv.cm / cmtTp->cmtv«real:0.0); 
printf( ,r Using a SNlogN counter, %g MFLOPS\n ,, ,.000005*i*n/cmtvp->cmtv_cm*lfactor); 

CM_reset-timer(); 

} 
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CM-JFt—cL-l . c 


CMJfLd-Lc - R. Kamin - Sept. 1, 1988 
Paris Version 5.0B 

This program performs a multidimensional DIF FFT on N 
complex points using N processors. The data is loaded 
in by the front end processor in normal order and read 
back in bit-reversed order. The twiddle factors 
are computed in-line by the CM-2 processors. 


Use <cc XXX.c -lparis -1m -o fft> to compile 


* Note: Must use 32-bit floating point to use FPU 

v 

#include <math.h> 

# include <stdio.h> 

#include <cm/paris.h> 

#define PI 3.14159265358979323846 
#define MAX-N 32769 

#define MAN 23 /* Length of mantissa; D=52 S=23 */ 

#define EXP 8 /* Length of exponent; D=I1 S=8 */ 

#define INT-SIZE MAN+EXP+1 /• Length of integer */ 

#define MAX(A, B) ((A) > (B) ? (A) : (B)) 


main () 

{ 

int max-dim, N, reg[lO], wdsize, i, n, base[l5], size[I5]; 
CM_init(); 

printf("Enter the number of dimensions ->”); 
scanf(”%d",&max_dim); 

base[0]=0; 


for (i=l; i<=max_dim; i-H-) 

{ 

printf(”Size of dimension %d (in log base 2) 

scanf(”%d”,&size[ij); 

base[i]=base[i-l]+size[i]; 

} 


n=base[max_dim]; 

N=(l«n); 
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/* Set up the static memory in each PE 

• 

* reg[0] - Self-address 

* reg[l] - First data register (REAL PART) 

* reg(2] - First data register (COMPLEX PART) 

* reg[3] - Second data register (REAL PART) 

* reg[4] - Second data register (COMPLEX PART) 

* reg[5] - 1-bit context flag for PEs < N 

* reg[6] - Destination address register 

* reg [7] - Temp register 

7 

wdsise=MAN+EXP+l; 

reg [0] = CM_allocate-stack-field (INT-SIZE); 
reg[l]=CM_aUocate— stack-field (wdsise); 
reg [2] — CM_allocate-st ack_field (wdsiae); 
reg[3]=CM-allocate-stack-field(wdsize); 
reg [4] = CM_allocate_stack_field(wdsiie); 
reg[5]=CM-allocate_stack-field(l); 
reg[6]=CM_aUocate-stack_field(INT_SIZE); 
reg[7]=CM-allocate-stack_field(wdsiae); 

CM_set-context; /* Activate all PEs */ 

CM-u_move-constant_lL(reg[5] ,0,1); 

printf( M Setting addresses of %d processors . . .”,MAX(N,8192)); 
for (i=0; i<MAX(N,8192); i++) 
CM_u-write-to_processor_lL(i,reg[0],i,INT-SIZE); 
printf("done!\n M ); 

printf( H Activating %d processors . . . H ,N); 

CM_u_lt-constant_lL(reg[0] ,N, INT-SIZE); /* Select low N PEs */ 

CM-load_context(CM_ test-flag); 

CM-u_move-constant-lL(reg[5],l,l); 

printf("donel\n M ); 

/* Load PEs with data in normal order */ 

printf("Loading data . . ."); 

CM-fLmove-constant.lL (reg [l] ,0.0, MAN, EXP ); 
CM-f-move-constant-lL(reg[2],0.0,MAN,EXP); 
CM-flmove-constant_lL(reg[3],0.0,MAN,EXP); 
CM-f_move-constant-lL(reg[4],0.0,MAN,EXP); 

CM-f-move-const ant_lL (reg [7 ] ,0.0 , MAN ,EXP ); 
printf( H donei\n"); 

fft(n, reg, base, max-dim); /* Execute the FFT */ 


} 

/*********************** M-D DIF FFT Function ******************************/ 

fft(n,reg, base, dimension) 

int n, reg[10], base[l5], dimension; 
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{ 

int If actor, q, t, i, two— i, mask, current-dim, previous-dim; 
double arg; 

CM_timevaL_t *cmtvp; 

printf("Enter loop factor 
sc anf("%d",&lf actor); 
printf("\n”); 
q= dimension; 

CM-start-timer(l); 

for (t=0; t<lfactor; t++) 

{ 

dimensions q; 

for (i=n-l; i>**0; i~) 

{ 

if (i<base[dimension]) 
dimensions dimension- 1; 
current— dim=ba8e{dimension] ; 
previous— dimsbase[dimension-f-l]; 

masks (1 « (previous_dim-current_ dim-l))-l ; 

/* Exchange data between PEs */ 

/* actually doing CUBE(i) */ 

tWO_is(l«i); 

CM_u_ move_always_lL(CM_context_flag,reg[5],l); 
CM_logxor_constant_3_lL(reg[6],reg[0],two_i,INT_SIZE); 

CM_send_lL(reg[3] ,reg[8],reg[l] .MAN+EXP+ 1); 

CM_send_lL(reg(4] ,reg[8] ,reg[2] .MAN+EXP+l); 

/* Select PEs for TOP butterfly */ 

CM-U— eq_sero— 1L (reg [0] +i,l); 

CM_load_context(CM_ test-flag); 

I* DO X=A+B */ 

CMJLadd-2-lL(reg[l],reg[3],MAN^XP); 

Ch4_£_add_2_lL(reg[2],reg[4]AIAN,EXP); 

/* Select PEs for BOTTOM butterfly */ 
CM_u_move_always_lL(CM_context_flag,reg[5],l); 

CM_load_test(reg[0]+i); 

CM_load_context(CM_test_flag); 

r DO Xs(A-B) V 

CM_f_subtract_2_lL(reg[l],reg[3],MAN,EXP); 

CM-f-su b tract_2_lL (reg [2] , reg [4] ,MAN,EXP ); 

if (i>current-dim) 

{ 

arg= -2*PI/(l<<(p revious_dim-cur rent-dim ) ) ; 

CM_u_multiply_constant-3_lL(reg[6],reg[0],(l«(previous-dim-i-l)),INT_SIZE); 

CM_logand_constant_2_lL(reg[8j,mask,INT_SIZE); 
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CM_JLu_float_2_2L (reg [3] ,reg [6] ,INT_SIZE ,MAN,EXP ) ; 
CM_Jlmultiply_constant_2_lL (reg [3] ,arg,MAN,EXP ); 
CM-JLsin-2-lL( r eg[4],r e g[3l,MAN,EXP); 

CNLJLcoa-l-lL (reg [3] , MAN, EXP); 

/* Multiply by twiddle fact.*/ 

CM-JLmultiply-3-lL (reg [7 ] ,reg [2] ,reg [4] , MAN, EXP ) ; 

CM-J_multiply_2_lL (reg [2] ,reg [3] ,MAN,EXP ); 
CM-f-muItiply-2^lL(reg[3],reg[l],MAN,EXP); 

CMJLmultiply-2_lL(reg[l] ,reg[4] ,MAN,EXP); 

CM*-fladd-2_lL (reg [2] ,r eg [l] ,MAN ,EXP ) ; 
CM_f_subtract_3_lL(reg[l],reg[3],reg(7],MAN,EXP); 

} 

} 

} 

cmtvp=CM^top_timer(0); 

i==(l<<n); 

printf(" CMJFL(Ll.c Paris Version 5.0B\n M ); 

printf("\nTimes computed for a single %d-point %d-dimension complex FFT\n”,i,q); 
printf("Times based on a loop factor of %d and using %d PEs\n M ,lfactor,i); 
printf( M Twiddle factors are computed in-line.\n"); 
printf(”Floating point hardware is ACTIVE !\n"); 

printf(" ; \n"); 

printf(”Real time: %g sec; CM time: %g sec; CM utilisation %g%%\ri' t 
cm typ->cm tr-real/lf actor, cm tvp->cmtY_cm /If actor, 
cmtvp->cmtv_real > 0.0 ? 

100.0 * cmtvp*>cmtY_cm / cmtrp->cmtv_real:0.0); 
printf(”Using a SNlogN counter, %g MFLOPS\n M ,.000005*i*n/cmtvp->cmtv_cm*lfactor); 

CM_reset_timer(); 

} 
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CM— Eft— d— 2. c 


CMJfLd_2.c - R. Kamin - October 22, 1988 
Paris Version 5.0B 

This program performs a multidimensional DEF FFT on N 
complex points using N/2 processors* The data is loaded 
in by the front end processor in normal order and read 
back in bit-reversed order. The twiddle factors 
are computed in-line by the CM-2 processors. 


Use <cc XXX.c -Iparis -lm -o fft> to compile 


• Note: Must use 32-bit format to take advantage of FPU 

7 

^include <math.h> 

#include <stdio.h> 

#inelude <cm/paris.h> 

#define PI 3.14159265358979323846 
#define MAX_N 32769 

#define MAN 23 /* Length of mantissa; D=52 S=23 * j 

#define EXP 8 /* Length of exponent; D=ll S=8 */ 

#define INT-SIZE MAN+EXP+1 /• Length of integer */ 

#define MAX(A, B) ((A) > (B) ? (A) : (B)) 


main () 

{ 

int max-dim, N, reg[l8], wdsise, i, n, base[15], size [15]; 
CM-initO; 

printff’Enter the number of dimensions •>"); 
scanf(”%d H ,&max-dim); 

base[0]=0; 


for (i=l; i<=max_dim; i-H*) 

{ 

printf("Sise of dimension %d (in log base 2) ->",i); 

scanf(”%d” ,&sise [i]); 

base[i]=base[i-l]+sise(i]; 

} 


n=base [max-dim] ; 
N— (l«n); 
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/* Set up the static memory in each PE 
* 

* reg[0] - SR = Self-address Register 

* reg[l&2] - Rl = Complex Register #1 

* reg[3&4] - R2 = Complex Register #2 

* reg[5&6] - T1 = Temporary Complex Register #1 

* reg[7&8] - T2 = Temporary Complex Register #2 

* reg[9&10] - W = Complex Twiddle F actor Register 

* reg[ll&12] - DTRl = Complex Data Transfer Register #1 

* reg[l3&14] - DTR2 = Complex Data Transfer Register #2 

* reg[l5] • DR — Destination address Register 

* reg[l6] - CFS = Context Flag storage register 

* reg[l7] - TMP = Temporary integer register 

7 

wdsiie=MAN+EXP+l; /* Floating point number length */ 

reg[0f— CM-allocate-fltack-Jield(INT_SIZE); 
for (i=l; i<15; i++) 
reg[i]=CM_allocate-3tack-field(wdsiae); 
reg [l 5] =CM_allocate_ stack-field (INT— SIZE); 
reg [l 6] = CM-allocate_stack-field ( 1 ) ; 
reg [17] = CM-allocate-stack— field (INT-SIZE); 

CM_set_context; /* Activate all PEs */ 

CM_u-move_constant-lL(reg[16] ,0,1); 

printf( H Setting addresses of %d processors • * . M ,MAX(N/2,8192)); 
for (i=0; i<MAX(N/2,8192); i++) 

CM-u-write-to-processor_lL(i,reg[0],i, INT-SIZE); 
printf( H done!\n"); 

printf( H Activating %d processors . . .",N/2); 

CM-u_lt_constant-lL(reg[0],N/2, INT-SIZE); /* Select low N/2 PEs */ 

CM-load_context(CM-test_flag); 

CM-u_move-constant-lL(reg [1 6] ,1 , 1); 
printf("done!\n"); 

/* Load PEs with data in normal order */ 

printf(”Loading data . . ."); 

CM-fLmove-constant-lL(reg [l] ,0.6, MAN, EXP); 
CMJLmove-constant-lL(reg[2] ,0.0, MAN, EXP); 

CM_f-move-conat ant-lL (r eg [3 ] , 0 . 0 ,MAN ,EXP ) ; 
CM-f_move-constant_lL(reg[4] ,0.0, MAN, EXP); 
CM-fLmiove-constant-lL(reg[9],0.0,MAN3XP); 
CM_f-move_constant_lL(reg[10],0.0,MAN,EXP); 

printf("done!\n”); 

fft(n, reg, base, max-dim); /* Execute the FFT */ 


} 
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/4******************* jyj.j) j)ip FFT Function ******************************/ 

fft(n, reg, base, dimension) 

int n, reg[18], base[l5], dimension; 

{ 

int If actor, q, t, i, half_two_i, mask, current-dim, previous-dim; 
double arg; 

CM-timevaL-t *cmtvp; 

printf("Enter loop factor ->"); 
scanf("%d’*,&lf actor); 
printf("\n*'); 

q= dimension; 

CM-start_timer ( 1 ) ; 

for (t=0; tClfactor; t++) 

{ 

dimension^ q; 


for (issn-1; i>=0; i— ) 

{ 

if (i<base[dimension]) /* Compute which dimension */ 

dimensions dimension- 1 ; 
current-dim = b ase [dimension] ; 
previous— dims b ase [dimension+1] ; 

masks(l«(previous-dim-current-dim-l))-l; 

halfLtwo-is (1« (i-1 )); 

CM-load_context(reg[18]); /* Enable N/2 PEs */ 

CM_f_add_3-lL(reg [5] ,reg[l] ,reg[3] ,MAN,EXP); /* Tl=Rl+R2 */ 

CM_f-add_3_lL(reg[8],reg[2],reg[4],MAN,EXP); 

CM-fLsubtract-3_lL(reg[7],reg[l],reg[3],MAN,EXP); /* T2sRi-R2 */ 
CM-fLsubtract_3-lL(reg[8] ,reg[2] ,reg[4] ,MAN,EXP); 

if (i>0) 

{ 

if (i==current_dim) 

{ 

CM-fLmove-lL (reg [1 1] ,reg [7 ] ,MAN ,EXP ) ; 

CM .,f-move-lL(reg[l2] ,reg[8] , MAN, EXP); 

} 

else 

{ 

args -2*PI/(l<<(previoua-dim-current_dim)); 

/* Compute W */ 

CM-logand-constantw3-lL(reg[l7],reg[0],mask,INT_SIZE); 
CM-u-add-constant-3-lL(reg[l5],reg[17],l«(i-current_dim),INT-SIZE); 
CM-logior-2-lL (reg [l 5] ,reg [17] ,INT_SIZE); 

CM_u-multiply-constant_2-lL(reg[15],(l«(previous-dim-i-l)),INT-SIZE); 
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CM_logand_constant_2_ lL(reg[l5],mask,ENT_SIZE); 
CM_f_u_float_2_2L(reg[9],reg[15],INT_SIZE,MAN,EXP); 
CM_f_multiply_constant_2_lL(reg[9],arg,MAN,EXP); 
CM_f_ain_2_lL(reg[10],reg[9],MAN,EXP); 

CM_f_cos_l_lL (reg [9] , MAN, EXP); 

/* DTR1=T2*W */ 

CM_f_multiply_3_lL(reg[ll],reg[7],reg[9],MAN,EXP); 
CM_f_multiply_3_lL (reg [1 2] ,reg [8] , reg [10] , MAN, EXP ); 

CM-f-subtr act-2-lL (reg [1 1] ,reg [12] , MAN, EXP ); 
CMJLmultiply_3.lL(reg[l 2] ,reg[8] ,reg [9] , MAN, EXP ); 
CM-f-multiply.3.lL(reg[13] ,reg [7],reg [10] , MAN, EXP ); 
CM_f_add_2_lL(reg[12],reg[13],MAN,EXP); 

} 

/* Select PEs for BOTTOM butterfly */ 

CM_load_test(reg [0] +i- 1 ); 

CM_load_context(CM_test_flag); 

CM_f_moYe_lL(reg[7],reg[ll],MAN,EXP); /* T2=DTRl */ 

CM_f_moTe_lL(reg[8] ,reg[12] , MAN, EXP); 

CM_f^move_lL(reg [1 1] ,reg [5] , MAN, EXP ); /* DTRl=Tl */ 

CM-f-move— 1L (regfl 2] ,reg[6 j , MAN, EXP); 

/• Select all N/2 PEs 7 

CM_load_context(reg[lfl]); 

CM_logxor_constant_3_lL(reg[l5],reg[0],half_two_i,INT_SIZE); 
CM_send_lL(reg[l3],reg[l5],reg[ll]^tAN+EXP+l); /* DTR2 <- DTRl */ 
CM_3end_lL(reg[l4],reg[l5],reg[l2],MAN+EXP+l); 


CM_£_moYe_lL(reg[l],reg[5],MAN,EXP); /* Rl=Tl */ 

CM_fLmoYe_lL(reg[2],reg[6],MAN,EXP); 

CM_f_move_lL(reg[3],reg[l3],MAN,EXP); /* R2=DTR2 */ 

CM_f- move-lL (reg [4] ,reg[14],MAN,EXP); 


/• Select PEs for BOTTOM butterfly */ 
CM_load_test(reg[0]+i-l); 

CM-load— context(CM_test_flag); 


CM_f_move_lL(reg[l],reg[l3],MAN,EXP); 
CM_f_inoYe_lL(reg[2],reg[l4],MAN,EXP); 
CM_f_more_lL(reg[3] ,reg[7],MAN,EXP); 
CM_f_move_lL(reg [4] ,reg [8] , MAN, EXP ); 

} 

else 

{ 

CM_fLmove_lL(reg[l],reg[5],MAN,EXP); 
CM_f_move_lL(reg[2],reg[6],MAN,EXP); 
CM— f_move_lL(reg[3],reg[7],MAN,EXP); 
CM_f_moy e_lL (reg [4] ,reg [8] , MAN, EXP); 

} 

} 

} 

cmtvp=CM_stop_timer(0); 

i=(l«n); 


/* R1«=DTR2 7 
I* R2=T2 7 

/* R1=T1 7 

/* R2=T2 7 
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printf(" CM-JFt_d_2.c Paris Version 5.0B\n"); 

printf("\nTimes computed for a single 96d-point %d-dimension complex FFT\n”,i,q); 
printf( M Times based on a loop factor of %d and using %d PEs\n”,lfactor,i/2); 
printf(”Twiddle factors are computed in-line.\n"); 
printf(”FIoating point hardware is ACTIVE!\n”); 

printf( H \n”); 

printf(”Real time: %g sec; CM time: %g sec; CM utilisation %g%%\n", 
cm tvp->cmtv_real/lf actor, cmtvp->cmtv_cm /If actor, 
cmtvp->cmtv_real > 0.0 ? 

100.0 * cmtvp->cmtv_cm / cmtvp->cmtv_real:0.0); 
printf( M Using a SNlogN counter, %g M?LOPS\n* , ,.000005*i*n/cmtvp->cmtv_cm*lfactor); 

CM_reset_timer(); 

} 



